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Abstract 

Markov chain Monte Carlo methods are often deemed too computationally intensive to 
be of any practical use for big data applications, and in particular for inference on datasets 
containing a large number n of individual data points, also known as tall datasets. In scenar¬ 
ios where data are assumed independent, various approaches to scale up the Metropolis- 
Hastings algorithm in a Bayesian inference context have been recently proposed in ma¬ 
chine learning and computational statistics. These approaches can be grouped into two 
categories: divide-and-conquer approaches and, subsampling-based algorithms. The aims 
of this article are as follows. First, we present a comprehensive review of the existing 
literature, commenting on the underlying assumptions and theoretical guarantees of each 
method. Second, by leveraging our understanding of these limitations, we propose an orig¬ 
inal subsampling-based approach which samples from a distribution provably close to the 
posterior distribution of interest, yet can require less than 0{n) data point likelihood eval¬ 
uations at each iteration for certain statistical models in favourable scenarios. Finally, we 
have only been able so far to propose subsampling-based methods which display good per¬ 
formance in scenarios where the Bernstein-von Mises approximation of the target posterior 
distribution is excellent. It remains an open challenge to develop such methods in scenarios 
where the Bernstein-von Mises approximation is poor. 
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1 Introduction 

Performing inference on tall datasets, that is datasets containing a large number n of individ¬ 
ual data points, is a major aspect of the big data challenge. Statistical models, and Bayesian 
methods in particular, commonly demand Markov chain Monte Carlo (MCMC) algorithms to 
make inference, yet running MCMC on such tall datasets is often far too computationally in¬ 
tensive to be of any practical use. Indeed, MCMC algorithms such as the Metropolis-Hastings 
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(MH) algorithm require at each iteration to sweep over the whole dataset. Frequentist or varia¬ 
tional Bayes approaches arc thus usually preferred to a fully Bayesian analysis in the tall data 
context on computational grounds. However, they might be difficult to put in practice or justify 
in scenarios where the likelihood function is complex; e.g. non-differentiable (Chernozhukov 
& Hong, 2003). Moreover, some applications require precise quantification of uncertainties 
and a full Bayesian approach might be preferable in those instances. This is the case for exam¬ 
ple for applications from experimental sciences, such as cosmology (Trotta, 2006) or genomics 
(Wright, 2014), were such big data problems abound. Consequently, much efforts have been 
devoted over recent years to develop scalable MCMC algorithms. These approaches can be 
broadly classified into two groups: divide-and-conquer approaches and subsampling-based al¬ 
gorithms. Divide-and-conquer approaches divide the initial dataset into batches, run MCMC on 
each batch separately, and then combine these results to obtain an approximation of the poste¬ 
rior: Subsampling approaches aim at reducing the number of individual data point likelihood 
evaluations necessary at each iteration of the MH algorithm. 

After briefly reviewing the limitations of MCMC for tall data, introducing our notation and 
two running examples in Section 2, we first review the divide-and-conquer literature in Sec¬ 
tion 3. The rest of the paper is devoted to subsampling approaches. In Section 4, we discuss 
pseudo-marginal MH algorithms. These approaches are exact in the sense that they target the 
right posterior distribution. In Section 5, we review other exact approaches, before relaxing ex¬ 
actness in Section 6. Throughout, we focus on the assumptions and guarantees of each method. 
We also illustrate key methods on two running examples. Finally, in Section 7, we improve over 
our so-called confidence sampler in (Bardenet et al. , 2014), which samples from a controlled 
approximation of the target. We demonstrate these improvements yield significant reductions in 
computational complexity at each iteration in Section 8. In particular, our improved confidence 
sampler can break the 0(n) bander of number of individual data point likelihood evaluations 
per iteration in favourable cases. Its main limitation is the requirement for cheap-to-evaluate 
proxies for the log-likelihood, with a known error. We provide examples of such proxies relying 
on Taylor expansions. 

All examples can be rerun or modified using the companion IPython notebook 1 to the paper, 
available as supplementary material. 

2 Bayesian inference, MCMC, and tall data 

In this section, we describe the inference problem of interest and the associated MH algorithm. 
We also detail the two running examples on which we benchmark key methods in Section 4, 5 
and 6. 

2.1 Bayesian inference 

Consider a dataset 

X = {x u ... 

'The IPython notebook and a static 
http://www.2020science.net/research/scaling-mcmc-methods. 


,Xn}CXcR d , (1) 

html render of it can both be found at 
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and a parameter space 0. We assume the data are conditionally independent with associated 
likelihood YYl=iP{ x i\0) given a parameter value 6 and we denote 1(0) the associated average 
log-likelihood 

. n -i n 

t{0) = - log p(xi\6) = -Y'£i(0). ( 2 ) 

Z—1 z=l 

We follow a Bayesian approach where one assigns a prior p(0) to the unknown parameter, so 
that inference relies on the posterior distribution 

7 t(6) = p(0\x) oc 7 (0) = p(6)e n ^ d \ (3) 

where 7 denotes an unnormalized version of n. In most applications, 7 r is intractable and we 
will focus here on Markov chain Monte Carlo methods (MCMC; Robert & Casella, 2004) and, 
in particular, on the Metropolis-Hastings (MH) algorithm to approximate it. 


2.2 The Metropolis-Hastings algorithm 

A standard approach to sample approximately from 7 t(9) is to use MCMC algorithms. To 
illustrate the limitation of MCMC in the tall data context, we focus here on the MH algorithm 
(Robert & Casella, 2004, Chapter 7.3). The MH algorithm simulates a Markov chain (9k)k>o of 
invariant distribution 7 r. Then, under weak assumptions, see e.g. (Douc et al. , 2014, Theorem 
7.32), the following central limit theorem holds for suitable test functions h 


\/ ^Viter 


1 

27iter 


-Niter 


k=0 


h(6)ir(6)d6 




(4) 


where convergence is in distribution. 


MH( 7 (-), q(.\.), 9 0 , -/Viter) 


1 for k <— 1 to Alter 

2 6 1, 

3 O' ~ q(.\0), 

4 u ~ W( 07 ) 


5 a (Q 8'\ li?l x 

6 if u < c*( 0 , 0 ') 

7 Ok ^ O' > Accept 

8 else 0k t— 0 > Reject 

9 return (0 fe ) fe= i,...,iv iBr 


Figure 1: The pseudocode of the MH algorithm targeting the distribution 7r. Note that 7r is only 
involved in ratios, so that one only needs to know an unnormalized version 7 of it. 
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The pseudocode of MH targeting a generic distribution 7 r is given in Figure 1 . In the case 
of Bayesian inference with independent data (3), Step 5 is equivalent to setting 



+ n [£(6') — i{9)\ . 


(5) 


When the dataset is tall (n S> 1), evaluating the log likelihood ratio in (5) is too costly an 
operation and rules out the applicability of such a method. As we shall see, two possible options 
are to either divide the dataset into tractable batches, or approximate the acceptance ratio in (5) 
using only part of the dataset. 

2.3 Running examples 

We will evaluate some of the described approaches on two illustrative simple running examples. 
We fit a one-dimensional normal distribution p(-\n, a) = a 2 ) to 10 5 i.i.d. points drawn 

according to Xi ~ jV(0,1) and lognormal observations X- L ~ logAA(0,1), respectively. The 
latter example illustrates a misspecification of the model. We assign a flat prior p(//. log a) oc 
1. For all algorithms, we start the chain at the maximum a posteriori (MAP) estimate. The 
MH proposal is an isotropic Gaussian random walk, whose stepsize is first set proportional to 
1 /y/n and then adapted during the first 1 000 iterations so as to reach 50% acceptance. When 
applicable, we also display the number of likelihood evaluations per iteration, and compare it 
to the n evaluations required at each iteration by the MH algorithm. 

In Figure 2, we illustrate the results of 10 000 iterations of vanilla MH on each of the 
two datasets. MH does well, as the posterior coincides with that of a longer reference run of 
50 000 iterations in each case, and the autocorrelations show a fast exponential decrease. The 
Bernstein-von Mises approximation (van der Vaart, 2000, Chapter 10.2), a Gaussian centered 
at the true value, with covariance minus the scaled inverse Fisher information, is a very good 
approximation to the posterior in both cases. We are thus in simple cases of heavy concentration 
of the posterior, where subsampling should help a lot if it is to be of any help in tackling tall 
data problems. 

3 Divide-and-conquer approaches 

A natural way to tackle tall data problems is to divide the data into batches, run MH on each 
batch separately, and then combine the results. 

3.1 Representing the posterior as a combination of batch posteriors 

Assume data X are divided in B batches xi,..., xg. Relying on the equality 


B 


p(6\X) oc \\p{0) l,B p{yLi\6), 


( 6 ) 


Huang & Gelman (2005) propose to combine the batch posterior approximations using Gaus¬ 
sian approximations or importance sampling. Scott et al. (2013) propose to average samples 
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(a) Chain histograms, X % ~ A/”(0,1) 
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(b) Chain histograms, X % ~ logA/”(0,1) 




(c) Autocorr. of log a, Xi ~ A/”(0,1) 


(d) Autocorr. of logcr, Xi ~ logA/”(0,1) 


Figure 2: Results of 10 000 iterations of vanilla MFI fitting a Gaussian model to one-dimensional 
Gaussian and lognormal synthetic data, on the left and right panel, respectively. Figures 2(a) 
and 2(b) show the chain histograms, joint and marginals; the x-axis corresponds to the mean of 
the fitted Gaussian, the y-axis to the standard deviation. We have superimposed a kernel density 
estimator of a long MH chain for reference in green and the Bemstein-von Mises Gaussian 
approximation in red. Figures 2(c) and 2(d) show the marginal autocorrelation of logcr in 
blue. The green curves are baselines that correspond to the long MH reference run depicted in 
green in the top panel; although the green autocorrelation functions are of limited interest when 
comparing them to vanilla MH, we use them as reference in all similar later figures. 
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across batches, noting this is exact under Gaussian assumptions. Neiswanger et al. (2014) 
propose to run an MCMC chain on each batch x, targeting an artificial batch posterior 

7Ti(e) OC p(0) 1 / B p(x i \&), 

fit a smooth approximation to each batch posterior, and multiply them. These methods arc 
however theoretically justified only when batch posteriors are Gaussian, or when the size of 
each batch goes to infinity, to guarantee that the used smooth approximation of each batch 
posterior is accurate. 

There are few results available on how the properties of combined estimators scale with the 
number of batches B. Neiswanger et al. (2014) fit a kernel density estimator to the samples 
of each batchwise chain, and multiply the resulting kernel density estimators. A sample from 
this mixture approximation to n is then obtained through an additional MCMC step. Under 
simplifying assumptions (all MCMC chains are assumed being independent draws from their 
targets, for example), a bound on the MSE of the final estimator is obtained. However, this 
bound explodes as the kernel bandwidth goes to zero, and more importantly, it is exponential in 
the number of batches B. In a tall data context, the number of batches is expected to grow with 
n to ensure that the size of each batch is less than O(ri'). Thus, the proposed bound is currently 
not informative for tall data. 

As pointed out by Wang & Dunson (2013), if the supports of the tt, are almost disjoint, 
then the product of their - approximations will be a poor approximation to tt. To improve the 
overlap between the approximations of the irfs, Wang & Dunson (2013) propose to replace the 
posterior in (6) by the product of the Weierstrass transforms of each batch posterior. When the 
approximation of tt, is an empirical measure, its Weierstrass transform corresponds to a kernel 
density estimator. The product of the Weierstrass transforms can be interpreted as the marginal 
distribution of an extended distribution on (-) 11 * 1 , where the first B copies of 6 are associated 
to the B batches, and the remaining copy is conditionally Gaussian around a weighted mean of 
the first B copies. Unfortunately, sampling from the posterior of this artificial model is difficult 
when one only has access to approximate samples of each 7ly. 

Although it is not strictly speaking a Monte Carlo method, we note that Xu et al. (2014) 
and Gelman et al. (2014) propose an expectation-propagation-like algorithm that similarly 
tackles the issue of disjoint approximate batch posterior supports. Each batch of data points is 
represented by its individual likelihood times a cavity distribution. The cavity distribution is 
itself the product of the prior and a number of terms that represent the contributions of other 
batches to the likelihood. The algorithm iterates between 1) simulating from each batchwise 
likelihood times a batch-specific cavity distribution, and 2) fitting each batch-specific cavity 
component. Again, while these approaches are computationally feasible and appear to perform 
well experimentally, it is difficult to assert the characteristics of the proposed approximation of 
the posterior and there is no convergence guarantee available for this iterative algorithm. 

3.2 Replacing the posterior by a geometric combination of batch posteriors 

Another avenue of research consists in avoiding multiplying the batch posteriors by replacing 
the target by a different combination of the latter. 
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By introducing a suitable metric on the space of probability measures such as the Wasser- 
stein metric, it is for example possible to define the barycenter or the median of a set of probabil¬ 
ity measures. Minsker et al. (2014) propose to output the median of the batch posteriors, while 
Srivastava et al. (2014) use the Wasserstein barycenter, which can be computed efficiently in 
practice using the techniques developed by Cuturi & Doucet (2014). While this idea has some 
appeal, the statistical meaning of these median or mean measures is unclear, and the robustness 
of the median estimate advocated in (Minsker et al. ,2014) may also be a drawback, as in some 
circumstances valuable information contained in some batches may be lost. 

To conclude, divide-and-conquer techniques appear as a natural approach to handle tall data. 
However, the crux is how to efficiently combine the batch posterior approximations. The main 
issues arc that the batch posterior approximations potentially have disjoint supports, that the 
multiplicative structure of the posterior (3) leads to poor scaling with the number of batches, 
that theoretical guarantees are often asymptotic in the batch size, and that cheap-to-sample 
combinations of batch posteriors are difficult to interpret. 

4 Exact subsampling approaches: Pseudo-marginal MH 

Pseudo-marginal MH (Lin et al. , 2000; Beaumont, 2003; Andrieu & Roberts, 2009) is a variant 
of MH, which relies on unbiased estimators of an unnormalized version of the target. Pseudo¬ 
marginal MH is useful to help understand several potential approaches to scale up MCMC. We 
start by describing pseudo-marginal MH in Section 4.1. Then, we present two pseudo-marginal 
approaches to tall data in Section 4.2 and Section 4.3. 

4.1 Pseudo-marginal Metropolis-Hastings 

Assume that instead of being able to evaluate 7 (9), we have access to an unbiased, almost-surely 
non-negative estimator 7 (0) of the unnormalized target 7 (6). Pseudo-marginal MH substitutes 
a realization of 7 (O') to 7 {O') in Step 5. Similarly, it replaces 7 {6) in Step 5 by the realization of 
7 (9) that was computed when the parameter value 6 was last proposed. Pseudo-marginal MH is 
of considerable practical importance, with applications such as particle marginal MH (Andrieu 
et al. , 2010) and MCMC versions of the approximate Bayesian computation paradigm (Marin 
et al. , 2012). It is thus worth investigating its use in the context of tall data problems. 

The possibility to use an unbiased estimator of 7 comes at a price: first, the asymptotic vari¬ 
ance crj? m in (4) of an MCMC estimator based on a pseudo-marginal chain will always be larger 
than that of an estimator based on the underlying “marginal” MH (Andrieu & Vihola, 2015). 
Second, the qualitative properties of the underlying MH may not be preserved, meaning that the 
rate of convergence to the invariant distribution may go from geometric to subgeometric, for in¬ 
stance; see Andrieu & Roberts (2009) and Andrieu & Vihola (2015) for a detailed discussion. In 
practice, if the variance of 7 ( 1 )) is large for some value 4 G 0, then an MH move to r) might be 
accepted while 7 ( 1 ?) largely overestimates 7(7)). In that case, it is difficult for the chain to leave 
?), and pseudo-marginal MH chains thus tend to get stuck if the variance of the involved esti¬ 
mators is not controlled. When some tunable parameter allows to control this variance, Doucet 
et al. (2015) show that, in order to minimize the variance of MCMC estimates for a fixed com- 


putational complexity, the variance of the log-likelihood estimator should be kept around 1.0 
when the ideal MH having access to the exact likelihood generates quasi-i.i.d samples from 7r; 
or set to around 3.0 when it exhibits very large integrated autocorrelation times. In practice, the 
integrated autocorrelation times of averages under the ideal MH are unknown as this algorithm 
cannot be implemented. In this common scenario, Doucet et al. (2015) recommend keeping 
the variance around 1.5 as this is a value which ensures a small penalty in performance even in 
scenarios where 1.0 or 3.0 are actually optimal. They also show that the penalty incurred for 
having a variance too small (i.e. inferior to 0.2) or too large (i.e. superior to 10) is very large. 
When mentioning pseudo-marginal MH algorithms, we will thus comment on the variance of 
the logarithm of the involved estimators 7 (9), or, if not available, of their relative variance. 

4.2 Unbiased estimation of the likelihood using unbiased estimates of the log- 
likelihood 

As described in Section 4.1, pseudo-marginal MH requires an almost-surely nonnegative un¬ 
biased estimator 7 (6) of the unnormalized posterior at 6, for any 9 in 0. It is easy to check 
that, by sampling from the dataset X with or without replacement, we obtain the 

following unbiased estimator of the log-likelihood n£{9) 

t 

n£{9) = 7 5^ lo gp(®i|0)- (7) 

i= 1 

We denote by £(9) the subsampling estimate of the average log-likelihood and denote by cr t.(9) 2 
its variance. Obviously, exponentiating (7) does not provide an unbiased estimate of the like¬ 
lihood e n ^ 0> . However, an interesting question is whether one can design a procedure which 
outputs an unbiased, almost-surely nonnegative estimate of e n ^ using unbiased estimates of 
n£{6) such as n£{9). Without making any further assumption about n£(9), it was recently 
shown by Jacob & Thiery (2013) that it is not possible. However, this can be done if one fur¬ 
ther assumes, for instance, that there exists a(9) such that £,(0) > a{6) for all i, see (Jacob 
& Thiery, 2013, Section 3.1) who rely on a technique by Rhee & Glynn (2013) generalizing 
(Bhanot & Kennedy, 1985). Unfortunately, as we shall see, the resulting estimator 7 (6) typ¬ 
ically has a very large relative variance, resulting in very poor performance of the associated 
pseudo-marginal chain. 

We apply (Rhee & Glynn, 2013, Theorem 1) to build an unbiased non-negative estimator of 
7 (9)/p(9), which is equivalent to defining 7 (9). For j > 1, let 

t. 

D *j = 7 5^ lo gp(sij|0) - na(9), (8) 

2=1 

be an unbiased estimator of n (£(9) — a{9)), where the x*j’s are drawn with replacement from 
X for each i, and are further independent across j. In (Jacob & Thiery, 2013, Section 3.1), 
N is an integer-valued random variable whose tails do not decrease too fast, in the sense that 
P (N > k) >(7(1 + e:) / - T° ease computations, we take N to be geometric with parameter 
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e/(l + e). This corresponds to the lightest tails allowed by (Jacob & Thiery, 2013, Section 3.1), 
since F(N >&) = (! + e)~ k . Finally, let 


Y = e na ( e ) 




1 1 

P(V > k) k\ 



(9) 


By (Rhee & Glynn, 2013, Theorem 1), Y is a non-negative unbiased estimator of the like¬ 
lihood e n ^ e \ As mentioned in Section 4.1, it is crucial, if we want to plug 7 (0) = Y x p(9) in 
a pseudo-marginal algorithm, to control the variance of its logarithm. The variance of log Y is 
difficult to compute, so we use here the relative variance of Y as a proxy. 


Proposition 4.1 Let 0 7 0 and Y be the almost surely non-negative estimator ofedefined 
in (9). Then its relative variance satisfies 


Vary 

e 2ni{0) 


> 


e -2n(£(0)-a(0))+2ny / (1+e)[<T t (0) 2 +(4(0)-a(0)) 2 ] 

n^(l + e)[a t (d)i + m-a(OW} 


+ 0 ( 1 ). 


( 10 ) 


The proof of Proposition 4.1 can be found in Appendix A. We can interpret ( 10) as follows: 
in order for the relative variance of Y not to increase exponentially with n, it is necessary that 
nofiO) is of order 1. But oy(0) if of order V 1 / 2 , so that the batchsize t would have to be of 
order n 2 , which is impractical. It is also necessary that \Jl + e is of order 1 + n~ 1 to control 
the term in (£(0) — a(8)). This means that e should be taken of order n -1 , but then the mean 
(1 + e)e^ 1 of the geometric variable N will be of order n. This entails that the number of terms 
in the randomly truncated series (9) should be of order n, which defeats the puipose of using 
this estimator. 

Flence for the reasons outlined in Section 4.1, we expect the pseudo-marginal MH relying 
on y to be highly inefficient. Indeed, we have not been able to obtain reasonably mixing chains 
even on our Gaussian running example. We have experimented with various choices of e, and 
with various values of t, but none yielded satisfactory results. We conclude that this approach 
is not a viable solution to MH for tall data. 

We note that Strathmann et al. (2015) have recently proposed a different way to exploit the 
methodology of Rhee & Glynn (2013) in the context of tall data. However, their methodology 
does not provide unbiased estimates of the posterior expectations of interest. It only provides 
unbiased estimates of some biased MCMC estimates of these expectations, these MCMC es¬ 
timates corresponding indeed to running an MCMC kernel on the whole dataset for a finite 
number of iterations. Strathmann et al. (2015) suggest that it might be possible to combine 
their algorithm with the recent scheme of Glynn & Rhee (2014) to obtain unbiased estimates 
of the posterior expectations. It is yet unclear whether this could be achieved under realistic 
assumptions on the MCMC kernel. 


4.3 Building 7 ( 6 ) with auxiliary variables 

In (MacLaurin & Adams, 2014), the authors propose an alternative MCMC to sample from n 
which, similarly to the method described previously, only requires evaluating the likelihood of 
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a subset of the data at each iteration. Assume a bound £i(9) > bi{9) is available for each i and 
9. For simplicity, we further assume that bi{9) = b(9, x r ) only depends on i through X{. This is 
the case in the experiments of (MacLaurin & Adams, 2014), as well as ours. Note also that in 
Section 4.2, we used a bound that was uniform in the data index i; we could have used similarly 
a non-uniform bound, but this would have made the derivation of Proposition 4.1 unnecessarily 
heavy. 

As noted in (MacLaurin & Adams, 2014), we can then define the following extended target 
7 f distribution on© x {0,l} n 

n 

ft(9, z) oc p{6) [exp (£i(9) - exp (bi(0))] Zi exp(bi{9)) l ~ Zi 

i= 1 

n n 

= p{9) n ex p(^(6*)) n [ exp (^*( 0 ) - b *(9)) - M zi ■ (H) 

i=l i =1 

This distribution satisfies two important features: it admits 7 r(9) as a marginal distribution, and 
its pointwise evaluation only requires to evaluate £i(9) for those V s for which z% = 1. Note 
that evaluating ft(0,z ) however requires to evaluate F[" | exp(fr,■(#)), and the bounds b,(9) 
thus must be chosen so that this computation is cheap. This is the case for the lower bound of 
the logistic regression log-likelihood model discussed in (MacLaurin & Adams, 2014), which 
is a quadratic form in ti9 J Xi, where ti is the ±1 label of datum X{ . The idea of replacing 
the evaluation of the target by a Bernoulli draw and the evaluation of a lower bound has been 
exploited previously; see e.g. (Mak, 2005). 

Any MCMC sampler could be used to sample from 7 r. MacLaurin & Adams (2014) propose 
an MFl-within-Gibbs sampler that leverages the known conditional tt(z\9). The expected cost 
of one conditional MH iteration on 9 at equilibrium, that is the average number of indices i 
such that Z{ = 1, is 0(n), and the constant is related to the expected relative tightness of the 
bound, see (MacLaurin & Adams, 2014, Section 3.1). The number of likelihood evaluations 
for an update of z conditional on 9 is explicitly controlled in (MacLaurin & Adams, 2014) by 
either specifying a maximum number of attempted flips, or implicitly specifying the fraction of 
flips to 1. 

The authors of MacLaurin & Adams (2014) remarked that their methodology is related to 
pseudo-marginal techniques but did not elaborate. We show here how it is indeed possible to 
exploit the extended target distribution 7r in (11) to obtain an unbiased estimate of an unnormal¬ 
ized version of 7 r. More precisely, we have 

p(xi\0) = p{xi,Zi\6) 

Zi£{ 0 , 1 } 

where p (zi\9, Xi) = {1 — exp (bi(9) — £i(9))} Zi exp(bi(9) — £i(9)) 1 ~ Zi . Hence, the marginal 
distribution of z. t under this extended model is given by 

p(zi = l\9) = J p(zi = l,Xi\6)dxi 

= J [exp(^(0)) - exp(6j(0))] dx t 
= l~Ie, (12) 
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where Ig = f exp (b(9, x))dx. Using Bayes’ theorem, we obtain accordingly 



An obvious unbiased estimator of the unnormalized posterior is thus given by 


n 


7(0) =p(0)~[]_p(xi\0,Zi) 


( 13 ) 


i— 1 


where each Zi is drawn independently given 0 from (12). Note that in the case of logistic 
regression, if bi{6) is chosen to be the quadratic lower bound given in (MacLaurin & Adams, 
2014), its integral Ig is a Gaussian integral and can thus be computed. Finally, similarly to the 
Firefly algorithm of MacLaurin & Adams (2014), the number of evaluations of the likelihood 
per iteration is nig, loosely speaking. 

Although the pseudo-marginal variant of Firefly we propose has the disadvantage of requir¬ 
ing the integrals Ig to be tractable, it comes with two advantages. First, the sampling of 2 does 
not require to evaluate the likelihood at all. If computing all bounds does not become a bottle¬ 
neck, this avoids the need to explicitly state a resampling fraction at the risk of augmenting the 
variance of the likelihood estimator. Second, the properties of this variant are easier to under¬ 
stand, as it is a ‘standard’ pseudo-marginal MH and hence the results from Section 4.1 apply. 
In particular, although it has the correct target distribution, the asymptotic variance of ergodic 
averages is inflated compared to the ideal algorithm. 

As explained in Section 4.1, we consider the variance of the log likelihood estimator. 

Proposition 4.2 Let 0 £ 0. With the notations introduced in Section 4.3, 



n 


n 


(14) 


The proof of Proposition 4.2 can be found in Appendix B. Proposition 4.2 can be interpreted 
as follows: the variance is related to how tight the bound is. In general, obtaining a variance of 
order 1 will only be possible if most bounds bi(8) are very tight, and the bigger n, the tighter the 


bounds have to be. These conditions will typically not be met when a fixed fraction of “outlier” 


xd s correspond to untight bounds. 

We give the results of the original Firefly MH on our running Gaussian and log normal 
examples in Figure 3. We bound each using a 2nd order Taylor expansion at the MLE and 
the Taylor-Lagrange inequality, see Section 7.2.1 for further details. This bound is very tight 
in both cases, so that we are in the favourable case where only a few components of 2 are 1 at 
each iteration, and the number of likelihood evaluations per full joint iteration is thus roughly 
the fraction of points for which 2 , has been resampled. We chose the fraction of resampled 
points to be 10% here, and initialized 2 to have 10% of ones. Trying smaller fractions led to 
very slowly mixing chains even for the Gaussian example. Estimating the number of likelihood 
evaluations per full joint iteration as the sum of the number of resampled zd s and the number 
of “bright” points, we obtained in both the Gaussian and lognormal case an almost constant 
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number of likelihood evaluations close to 10%, so that only a few points are bright. This can be 
explained by the tightness of the Taylor bound, which leads Firefly MH to almost exclusively 
replace the evaluation of the likelihood by that of the Taylor bound. Finally, unlike the other 
algorithms we applied, we observed that a bad choice of the initial value of z can easily take 6 
out of the posterior mode. To be fair - , we thus discarded the first 1 000 iterations as a burn-in 
before plotting. 

As expected, the algorithm behaves erratically in the lognormal case, as failure to attempt 
a flip of each Z{ draws the //-component of the chain towards the few large values of (x, — p,) 2 
which are bright. Since the bright points are rarely updated, the chain mixes very slowly. 


5 Other exact approaches 

Other exact approaches have been proposed, which do not rely on pseudo-marginal MH. 


5.1 Forgetting about acceptance: stochastic approximation approaches 

Welling & Teh (201 1) proposed an algorithm based on stochastic gradient Langevin dynamics 
(SGLD). This is an iterative algorithm which at iteration k + 1 uses the following update rule 

t 


0k+l ~ + 


Cfc+1 


V log p(0) + J V log p(x* k 1 6) 


2=1 


+ V e k+iVk+i, (15) 


(e*;) is a sequence of time steps, ( rj k ) are independent jV(0, I c j) vectors and 


^ ^2 V logp(xl k \9) 


is an unbiased estimate of the score computed at each iteration using a random subsample 
{<,} of the observations. This approach is reminiscent of the Metropolis-adjusted Langevin 
algorithm (MALA; Robert & Casella 2004, Section 7.8.5), where the proposal given by 


e' = e + 


e 

2 


n 

viog p{o) + ^viogpfoie) 
2=1 


+ \fen, 


is used in an MH acceptance step, where e ~ jV(0, Id)- The point of Welling & Teh (2011) is 
that if one suppresses the MH acceptance step, computes an unbiased estimate of the score but 
introduces a sequence of stepsizes (e k ) that decreases to zero at the right rate, then 

/ Wei \ ' Tier 

€k ) X! 

\k =0 / k =0 


is an approximation to ir. The algorithm has been analyzed recently in (Teh et al. , 2014), where 
it has been established that it provides indeed a consistent estimate of the target. Additionally, 
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Figure 3: Results of 10000 iterations of Firefly MH (MacLaurin & Adams, 2014) on our 
Gaussian and lognormal running examples. See Section 4.3 and the caption of Figure 2 for 
details. 
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a central limit theorem holds with convergence rate iV iter , which is slower than the traditional 
Monte Carlo rate N { ' . It is yet unclear - how SGLD compares to other subsampling schemes 
in theory: it may require a smaller fraction of the dataset per iteration, but more iterations are 
needed to reach the same accuracy. 

In practice, we show the results of SGLD on our two running examples in Figure 5.1. 
The stepsize e:/. is chosen proportional to k~ l 3 , following the recommendations of Teh et al. 
(2014). We show the results of two choices for the subsample size t: 10% and 1% of the 
data, with respectively 10 000 and 100 000 iterations, so that both runs amount to the same 10% 
fraction of the budget of the vanilla MH in Figure 2. Both runs are still far from convergence on 
the lognormal example: subsampling draws the chain away from the support of the posterior, 
and one has to wait for smaller stepsizes to avoid overconfident moves. But then, the variance 
of the final estimate gets bigger. Constant stepsizes lead to comparable results (not shown). 

Finally, we note that subsampling for Hamiltonian Monte Carlo (HMC; Duane et al. , 1987) 
has also been recently considered. Chen et al. (2014) propose a modification of HMC that is in¬ 
spired by the SGLD with decreasing stepsize of Welling & Teh (201 1), while Betancourt (2014) 
explores why naive approaches suffer from unacceptable biases. The algorithm of Chen et al. 
(2014) is however a heuristic, which further relies on the subsampling noise being Gaussian. 
As demonstrated in (Bardenet et al. , 2014) and in Section 6.2, relying on a Gaussian noise 
assumption can yield arbitrarily poor performance when this assumption is violated. 

5.2 Delayed acceptance 

Banterle et al. (2015) remarked that if we decompose the acceptance ratio in a product of 
positive functions 

B 

a(6,6’) = l[p i (d,e’) 

2—1 

then the MH-like algorithm that accepts the move from 0 to O' with probability 

B 

0 rn in [pi(0,0'), l] 

2—1 

still admits 7r as invariant distribution. In practice, in the case of tall data, we can divide the 
dataset into B batches and use for example 

(a an = p(0') 1/B p{xi\0')q(0\0') 

1 ’ p(0) 1 / B p(xi\0)q(0'\6') 

This allows us to reject candidate 6' without having to compute the full likelihoods and the 
calculations of pi{0,0') can be done in parallel. However, as remarked by Banterle et al. 
(2015), the resulting Markov chain has a larger asymptotic variance in (4) than the original 
MH, and it does not necessarily inherit the ergodicity of the original MH. Furthermore, by 
construction, every accepted point has to be evaluated on the whole dataset, and the average 
proportion of data points used is thus lower bounded by the acceptance rate of the algorithm, 
which in practice is often around 25%. Overall, it is an easy-to-implement feature that does not 
add any bias, but its benefits are inherently limited, and speed of convergence might be affected. 
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(c) Chain hist., 10 5 k iter, at 1%, Xi ~ A/*(0,1) (d) Chain hist., 10 5 k iter, at 1%, Xi ~ log A/"(0,1) 

Figure 4: Results of SGLD (Welling & Teh, 2011) on our Gaussian and lognormal running 
examples. See Section 5.1 and the caption of Figure 2 for details. 
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6 Approximate subsampling approaches 


In this Section, we consider again subsampling approaches where, at each MH iteration, a 
subset of data points is used to approximate the acceptance ratio (5). As mentioned in Section 
4.2, it is very simple to obtain an unbiased estimator of the log-likelihood nl{9) based on 
random samples x \,..., x\ from the dataset X\ see (7). Similarly, one can also easily obtain 
an unbiased estimator of the average log likelihood ratio [£(0') — 1(9) 


A t*(M') = 


1 

t 


E 1 ^ 


pi&W) 

p{x*\e) ' 


(16) 


Note that unlike the exact approaches of Sections 4 and 5, the methods reviewed in Section 6 
do not attempt to sample exactly from it, but just from an approximation to n. 


6.1 Naive subsampling 

The first approach one could tty is to only use a random fixed proportion of data points to 
estimate n at any newly drawn 0. We highlight that this leads to a nontrivial mixture target that 
is hard to interpret, where all subsampled posteriors appear, suitably rescaled. Assume that at 
each new 0 drawn in an MH run, we draw n independent Bernoulli variables and let 

1 n 

= ( 17 ) 

n z J A 

2=1 

be an unbiased estimator of the average log likelihood £(8), where Zi ~ li(l, A) i.i.d. Now 
one could think of plugging estimates 7 (9) = p(0)e nl ^ in Steps 2 and 3 of MH in Figure 1. 
However, as 7 (6) is not an unbiased estimator of 7 (6), this algorithm samples from a target 
distribution which is proportional to p(9)Ee n ^ e ' > / 7 (6), where the expectation is w.r.t the 
distributions of the Bernoulli random variables {zi}. Now 

n 

Ee n£(6) = [a^(@) 1 /A + (i — A) 

2=1 

n 

= E E 

r =0 #I r =r 

where I r denotes a set of r distinct indices in {1 ,..., n}, xj r = {xp,i £ I r }, and with the 
convention p{x$\0) = 1. Each subsampled likelihood contributes to the target, exponentiated 
to the power 1 /A, resulting in a nontrivial mixture of rescaled data likelihood terms. To fur¬ 
ther simplify, assume p(xi r \9) ~ p r {9) for each set of indices /,.. that is, the variance of the 
likelihood under subsampling is small, then 

n 

Ee ni(e) w C r n X r ( 1 - Ar-V(0) 1/A = E^ fl(niA )P ](\9), (18) 

r=0 
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where B(n, A) denotes the binomial distribution with parameters n and A. Noticing that p r (9) 
is roughly exponentially decreasing in r, the values or r that arc larger than the mode of the bi¬ 
nomial probability mass function are unlikely to contribute a lot to (18). The largest subsample 
size contributing to (18) is thus roughly nX, and the power 1/A makes this term of the same 
scale as p(x i ,... ,x n \9). Broadly speaking, subsampling has a “broadening” effect due to the 
contribution of the likelihoods of small subsamples. 

Alternately, if one starts with the biased estimator of the average log likelihood 

1 n 

m = -Y J zMo), (i9) 

n i =0 

instead of (17) one ends up with 

n 

Ee ni(e) = JJ [A4(0) + (i - A)] 

i= 1 
n 

= EE A'(l - 

r=0 #Ir=r 

~ ^R~B(n,\)PR(9)- ( 20 ) 

Again, all subsampled likelihoods contribute, but without being further exponentiated. Still, the 
result is a much broadened target, as values of r that are larger than n\ are unlikely to contribute 
a lot. In this case, the broadening effect of subsampling is not only due to the contribution of 
small subsamples, but also to the absence of bias correction in (19). 

We have thus seen that naive subsampling is nontrivial tempering, so that the target is not 
preserved. Additionally, as mentioned in Section 4.1, the variance of the log likelihood estima¬ 
tor needs to be kept around a constant, 1 or 3 depending on hypotheses, in order for pseudo¬ 
marginal MH to be efficient. This means that A should be such that 

2=1 

is of order 1 in the case of (17). This entails that A should be close to 1, so that there is no 
substantial gain in terms of number of likelihood evaluations. In the case of (19), 

n 

A(l-A )J>(0) 2 
2=1 

can be of order 1 if A ~ n~ l . But then the leading terms in the mixture target (20) will be 
the subsampled likelihoods corresponding to small subsamples, so that the target will be very 
different from the actual target 7r. 

Overall, naive subsampling is a very poor approach. However, it allows us to identify the 
main issues a good subsampling approach should tackle: guaranteeing its target, not loosing 
too much convergence speed compared to MH, and cutting the likelihood evaluation budget. 
As shown in (Bardenet et al. , 2014), the first point is an algorithmic design issue, while the last 
two points are related to controlling the variance of the log likelihood ratios. 
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6.2 Relying on the CLT 


Several authors have appealed to the central limit theorem to justify their assumption that the 
average subsampled log likelihoods and log likelihood ratios in (7) and (16) are Gaussianly 
distributed. 

If the noise of the log likelihood ratio estimate is normal with known variance and mean 
equal to the true log-likelihood ratio, Ceperley & Dewing (1999) have proposed an MH with a 
corrected acceptance ratio that is exact, i.e., that still targets it. When the variance of the noise is 
not known, and is rather estimated, the method becomes inexact. Nicholls et al. (2012) propose 
a heuristic argument to show that this inexact chain gives reasonable approximate results, but 
the Gaussian assumption remains crucial there. As shown in (Bardenet et al. , 2014) and in this 
paper in Figures 5 and 6, this assumption can be arbitrarily violated when subsampling tall data 
if the log likelihood ratios li{9') — l%{9) are heavy-tailed. Missing log likelihoods in the tails 
will lead to erroneous decisions, and uncontrolled results. 

6.2.1 A pseudo-marginal approach with variance reduction under Gaussian assumption 

Quiroz et al. (2014) propose a methodology to use MH for tall data which also relies on the 
assumption that the log-likelihood estimator is Gaussian with mean t{9), for every 9. By intro¬ 
ducing a bias correction providing an approximately unbiased estimate of the likelihood, this 
corresponds to a pseudo-marginal MH algorithm whose target distribution is proportional to 
p(9) Ee n ^ 6l ) _ k(Q, where b(9) is an estimate of the bias b(9) satisfying Ee n ^ = e n ^)+ b ( e )_ 
They rightly notice that, ideally, if one wants to keep the variance of average subsampled log 
likelihoods small, one should not subsample data points with or without replacement, but one 
should rather perform importance sampling with the weight of data point i being proportional 
to \£i(9)\. While this variance reduction approach obviously defeats the purpose of subsam¬ 
pling, Quiroz et al. (2014) propose to use as weights an approximation of the log-likelihood, 
based e.g. on a Gaussian process or splines trained on a small subset of computed likelihoods 
ii{9). Finally, a heuristic to adaptively choose the size of the total subsample so as to keep 
the variance of the log likelihood controlled is proposed. The method is demonstrated to work 
on a bivariate probit model using only 10% of the full dataset. However, as a general purpose 
method, it suffers from two limitations. First, it is based on Gaussian assumptions, which can 
be unreasonable as noted above and it is uncle ar whether it will be robust to these CLT approx¬ 
imations not being valid. Second, the proposed importance sampling step requires to learn a 
good proxy for x i —> p(x\9) for each 9 drawn during the MCMC run. The fitted proxies should 
thus be cheap to train and evaluate, but at the same time accurate if any variance reduction is to 
be obtained. 

6.2.2 Adaptive subsampling with T-tests 

Still assuming the noise of the log likelihood is Gaussian, given a drawn 9 e 0, one can try to 
adaptively choose the size of the subsample {x \,..., xjf} to be used in the unbiased estimators 
(7) or (16), so as to take the correct acceptance decision with high probability. Upon noting that 
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the MH acceptance decision is equivalent to deciding whether log a(9, O') > u, or equivalently 


1 

n 


J2 lo g 

2—1 


P(xj\d') 

P(xi\0 ) 


> 


— log-u 
n 


i \ p(e')g(e\o') 

n p{0)q(0'\9) 


( 21 ) 


with u ~ W [0) il drawn beforehand, statistical tests can be used to assert whether (21) holds with 
a given level of “confidence”. As far as we are aware, Bulgak & Sanders (1988) were the first to 
consider such a procedure. They used it in a simulated annealing algorithm maximizing a func¬ 
tion defined as an expectation w.r.t a probability distribution, and approximated using Monte 
Carlo. Simulated annealing is a simple non-homogeneous variant of the MH algorithm where 
the the target distribution is annealed over the iterations. The same application received more 
attention later (Alkhamis et al. , 1999; Wang & Zhang, 2006). Applied to the standard MH, the 
method has been considered by Singh et al. (2012), and more recently by Korattikara et al. 
(2014) specifically for tall data. Korattikara et al. (2014) propose an MH-like algorithm called 
Austerity MH that incorporates a sequential T-test to check (21) for each pair ( 9 , 6'), thus relying 
on several CLTs. They demonstrate dramatic reductions in the average number of subsamples 
used per MCMC iteration on particular applications. However, as noted in (Korattikara et al. , 
2014; Bardenet et al. , 2014), the results can be arbitrarily far from the original MH when the 
CLT approximations are not valid. 

We show the results of 10 000 iterations of Austerity MH on our two running examples in 
Figure 5. The parameters are e = 0.05, corresponding to the p-value threshold in the afore¬ 
mentioned T-test, and an initial subsample size of 100 at each iteration. In the Gaussian case, 
the posterior is rightly centered, but is slightly too wide. This is a tempering effect due to too 
small subsamples, while the CLT-based Student approximation seems reasonable, as shown in 
Figure 6. In the lognormal case, the departure of the chain from the actual posterior is more re¬ 
markable, and relatedly the CLT approximations of Austerity MH are inaccurate for the chosen 
initial subsample size of 100, as we demonstrate in Figure 6. This explains the strong mismatch 
of the chain and the posterior in Figure 5(b). The standard deviation of the fitted Gaussian is 
largely underestimated, due to small subsamples which do not include enough of the tails of the 
log likelihood ratios, which coincide with the tails of X. Finally, the reductions in the number 
of samples needed per iteration are quite interesting: half of the iterations require less than 4% 
of the dataset for the lognormal case, but this is at the price of a large error in the posterior 
approximation. Augmenting the initial size of the subsample will likely make the CLT approx¬ 
imations tighter, but there is no generic answer as to which size to choose: any fixed choice 
will fail on an example where the log likelihood ratios have heavy enough tails. In both the 
Gaussian and the lognormal example, it is actually safer to go with the Bernstein-von Mises 
approximation, which costs little more than a run of stochastic gradient descent, and only re¬ 
quires one CLT approximation, for a sample of size n> 1. This illustrates the danger of using 
CLT-based approximations for small sample sizes, which is related to asymptotic arguments on 
small batches in Section 3. 

Overall, CLT-based approaches to MH with tall data lead to heuristics with interesting re¬ 
ductions in the number of samples used, but they have little theoretical backing so far and they 
are not robust to the involved CLT approximations being inaccurate. We note also that the CLT 
is assumed to provide a good approximation for the log likelihood or log likelihood ratio for 
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(a) Chain histograms, X\ ~ A/”(0,1) (b) Chain histograms, X ~ log JV(0, 1) 




(c) Autocorr. of log a, X,, ~ A/”(0,1) 


(d) Autocorr. of log a, X t ~ logA/”(0,1) 



(e) Num. lhd. evals, Xi ~ A/”(0,1) 



(f) Num. lhd. evals, Xi ~ logA/”(0,1) 


Figure 5: Results 10000 iterations of Austerity MF1 (Korattikara et al. , 2014). See Section 6.2 
and the caption of Figure 2 for details. 
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(a) Student around MAP, X,, ~ log JV(0, 1) (b) Student around MAP, X t ~ log JV(0, 1) 

Figure 6: Histogram of 1000 realizations of the Student statistic required in Austerity MH, 
taken at 0 = #map and O' ~ q(-\0). The theoretical Student pdf is plotted in red. 


any 0,0' £ 0, which amounts to more than one Gaussian assumption. The approaches in this 
section should thus be applied with care. As a minimal sanity-check, we recommend using tests 
of Gaussianity across 0 x 0 to make sure the CLT assumptions are realistic. Note that even 
then, there is no guarantee the above algorithms have n for target, if any. 


6.3 Exchanging acceptance noise for subsampling noise 

This section is an original contribution, which illustrates a way to obtain subsampling algo¬ 
rithms with guarantees under weaker assumptions than Gaussianity. This approach is imprac¬ 
tical, but it is of methodological and illustrative interest. First it illustrates a potentially useful 
technique to take advantage of subsampling noise. Second, it is our first illustration of the 
seemingly inevitable 0(ri) average number of subsamples required per MCMC iteration as 
soon as we do not use any CLT-based approximation and require theoretical guarantees. 

Let 0,0' £ 0, and let x\,..., x* be drawn independently with replacement from X. Let 
A t(Q,0') be the average subsampled log likelihood ratio defined in (16). Now, we remark 
that MH has some inherent noise in its acceptance decision (21), encapsulated by the uniform 
variable u ~ U[ 0,1]. Why, then, not rely on the subsampling noise to guarantee exploration, 
and accept a move if and only if 


A t *(M') + -log 

n 


' p{o')q(o\o'y 

_p(0)q(0'\0) _ 


> 0 


( 22 ) 


instead of (21)? This idea has been first used by Branke et al. (2008) to develop heuristics for 
simulated annealing in the presence of noise. We formalize this argument here in the context 
of subsampling. For the sake of simplicity, assume for a moment we have a flat prior and a 
symmetric proposal, so that (22) becomes 


A t *(M')>0. 
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We do not assume that the A t(6,6')’s arc Gaussianly distributed, but we make the para¬ 
metric assumption that the second and third absolute moments a 2 and p of — \ogp(x*\9') + 
logp(x*\9) are known and independent of 9 , O'. Applying the Berry-Esseen inequality (van der 
Vaart & Wellner, 1996) to the variables — \ogp(x*\9') + \ogp(x*\9) yields 


P(-A*(M') <«)-$ 


f u + A n (9,9') \ 

V cr /Vt ) 


K((t, p ) 

Vi 


for any aGl, where $ is the cdf of a Af(0, 1) variable, and 


A n (M') = 


1 

n 


£** 

2=1 


P(xi\9') 

p(xi\0) 


(23) 


is the average log likelihood ratio. When u = 0, (23) yields 

K(a, p) 

Vi 

Now let C, A > 0 be such that for any i£l, 


P(A*(0,6>') > 0) - T> 


A n {8,8’) 

a/Vi 


4>(x) 


1 

1 + e~ Xx 


< C. 


(24) 


Bowling et al. (2009) for instance, empirically found C = 0.0095 and A = 1.702. Combining 
this bound with (24), we obtain 


P(A t(8,0') > 0) 


1 

\A n (e,o') 

1 -)- e <T / V '* 


< C + 


K(cr, p) 

Vi 


Hence, the acceptance probability of an algorithm that would accept the move from 8 to 9' by 
checking whether A *(8,8') > 0 is close to the acceptance probability of an MCMC algorithm 
with a Baker acceptance criterion (Robert & Casella, 2004, Section 7.8.1) that targets tt' with 
temperature (5 = Arguments such as (Bardenet et al. , 2014, Lemma 3.1, Proposition 
3.2) could then help concluding that the distance between the kernels of both Markov chains 
is controlled, which would yield positive ergodicity results, in the line of (Bardenet et al. , 
2014, Proposition 3.2). This reasoning shows again a close relation between subsampling and 
tempering, as in Section 6.1, with a clear link between the variance of the subsampled log 
likelihood ratios and the temperature. 

Now, from a practical point of view, in simple applications such as logistic regression, a is 
of the order of \\0 — f)'\\. which in turn should be of order (9 p (n -1 / 2 ) if the MCMC proposal is 
a Gaussian random walk with covariance similar to that of 7r, see Bardenet et al. (2014). This 
means that t has to be of order n for the temperature (3 to be of order 1, and this approach is 
thus bound to use 0{n) subsamples per iteration! In conclusion, robustness to non-Gaussianity 
leads to requiring a fixed proportion of the whole dataset on average, even in the favourable case 
when one controls the first three moments of the subsampling noise and one swaps subsampling 
noise for the inherent MCMC acceptance noise. 
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6.4 Confidence samplers 


In (Bardenet et al. , 2014), we proposed a controlled approximation of the acceptance deci¬ 
sion (21). Indeed, let us fix 6,0' and momentarily assume that x >-)■ \og\p(x\6') /p(x\0)\ was 
Lipschitz with known constant. Then, having observed the log likelihood ratio at some points 
{x*, i = 1,..., t} C X, one could build a lower and an upper bound for the complete log 
likelihood ratio 

n 

n 

i —1 

simply by associating each X{ with the nearest point among {xj,..., x £}. These bounds could 
be refined by augmenting the set of observed log likelihoods ratios, until eventually one knows 
for sure whether (21) holds. 

Now, concentration inequalities allow softer bounds and require less than this Lipschitz 
assumption. If one knows a bound for the range 


p(xj\0) 

p{xi\9) 


/~i A ^ 

La o' = max 
i=t 


log 


P{xi\9') 

p{xi\6) 


(25) 


then concentration inequalities such as Hoeffding’s or Bernstein’s, yield confidence bounds 
ct(S) such that 
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t 




' p(x*\9'Y 

_p{x*\6)_ 


1 
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1=1 


p{xj\6') 

p{xi\6) 


> ct{$) 


> 1 




(26) 


where the probability is taken over x\,...,x\ drawn uniformly from X, with or without re¬ 
placement. Borrowing from the bandit literature, we explain in (Bardenet et al. , 2014) how to 
leverage such confidence bounds to automatically select a subsample size T such that the right 
MH acceptance decision is taken with a user-specified probability 1 — 5. Note that for our algo¬ 
rithm to bring any improvement over the ideal MH, the range (25) must be cheap to compute, 
i.e. cheaper than 0(n). This is the case for logistic regression, for example, but it is the major 
limitation of the approach in Bardenet et al. (2014). We showed in (Bardenet et al. , 2014, 
Proposition 3.2) that if the ideal MH sampler is uniformly ergodic then the resulting algorithm 
inherits the uniform ergodicity of the ideal MH sampler, with a convergence speed that is within 
0(5) of that of the ideal MH. Importantly, we showed that our sampler then admits a limiting 
distribution, which is also within 0(5) of tv. Uniform ergodicity is a very strong assumption 
and it would be worth extending these results to the geometrically ergodic scenario. There has 
recently been work in this direction (Alquier et al. , 2014; Pillai & Smith, 2014; Rudolf & 
Schweizer, 2015). 

On the negative side, we demonstrated in (Bardenet et al. , 2014) that vanilla confidence 
samplers still require 0(n) samples at each iteration at equilibrium, where the proportionality 
constant is the variance of the log likelihood ratio under subsampling. This statement relies on 
the leading term in ct(5) being of order V '/ 2 . In practice, the results of the vanilla confidence 
sampler on our running examples are shown in Figure 7. We set 5 = 0.1 and we place ourselves 
in the favourable scenario where the algorithm has access to the actual range of each log like¬ 
lihood ratio. The number of likelihood evaluations is estimated as follows: we take by default 
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twice the detected value T for the subsample size in general, but only once when the previous 
iteration required computing all n likelihoods at the current state of the chain. Still, even in 
these favourable conditions, the algorithm basically requires essentially the whole dataset at 
each iteration. 

Concentration inequalities are “worst-case” guarantees, and the theoretical results come at 
the price of a smaller reduction in the number of samples required. When the target is locally 
Gaussian, e.g. when Bernstein-von Mises yields a good approximation, there is potentially a 
lot to be gained, as empirically demonstrated by Korattikara et al. (2014), for example. In 
the current paper, we propose in Section 7 a modified confidence sampler that can leverage 
concentration of the target to yield dramatic empirical gains while not sacrificing any theoretical 
guarantee of the confidence sampler. The basic tool is a cheap proxy for the log likelihood ratio 
that acts as a control variate in the concentration inequality (26). Using a 2nd order Taylor 
expansion centered at the maximum of the likelihood - obtained with a stochastic gradient 
descent for example - allows to replace many likelihood evaluations by the evaluation of this 
Taylor expansion. Figure 8 shows the results of this new confidence sampler with proxy on our 
running Gaussian and lognormal examples. Our algorithm outperforms all preceding methods, 
using almost no sample in the Gaussian case where it automatically detects that a quadratic 
form is enough to represent the log likelihood ratio. Finally, we demonstrate in Sections 7.2.3 
and 8 that this new algorithm can require less than 0(n) likelihood evaluations per iteration. 
Combined with the statements in Bardenet et al. (2014) that each iteration is almost as efficient 
as the ideal MH, which is further supported by the match of the autocorrelation functions in 
Figures 8(c) and 8(d), this opens up big data horizons. We give full details on the confidence 
algorithm with proxy in Section 7. 


7 An improved confidence sampler 

In this section, we build upon the confidence sampler in (Bardenet et al. ,2014) by introducing 
likelihood proxies, which act as control variates for the individual likelihoods. 


7.1 Introducing proxies in the confidence sampler 

We start by recalling the pseudocode of the confidence sampler in (Bardenet et al. , 2014) in 
Figure 9, using sampling with replacement and a generic empirical concentration bound 
In practice, one can think of the empirical Bernstein bound of Audibert et al. (2009) 




S' 


21og(3/<5) 667^/log(3/5) 

, I” , * 


where o tt e,8' is d lc sample standard deviation of the log likelihood ratio 


(27) 


log 


p(x* \e’) 

p{x*\9) 


= 1 , 




and Co jy is their range, defined in (25). We emphasize that other choices of sampling procedure 
and concentration inequalities are valid, as long as they guarantee a concentration like (26). We 
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Figure 7: Results of 10 000 iterations of the vanilla confidence sampler (Bardenet et al. , 2014), 
see Section 6.4 and the caption of Figure 2 for details. 
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Figure 8: Results of 10000 iterations of the confidence sampler of Section 7 with a single 2nd 
order Taylor proxy at $map- 
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refer the reader to (Bardenet et al. , 2014) for a proof of the correctness of the confidence 
sampler and implementation details. 


Confidences ampler (p(x|0), p(6), q(9'\9), 9 0 , N iteI , X, (S t ), C e ,e>,) 

1 

for k <— 1 to ./Viter 




2 

9 9k ~i 




3 

9' ~ q{-\9), u ~ W(o,i)> 



4 

ip(u, 9,9') <r- Mog 

L P(e)q(8' |fl)l 

l U p(8')q(e |0')J 


5 

t 4 — 0 




6 

Oook 0 




7 

A* t- 0 




8 

X* 0 > Keeping track of points already used 

9 

b <— 1 l> Initialize batchsize to 1 


10 

Done «- False 




11 

while Done == False do 



12 

X* +1 ,...,X* b ~ w /repl. X\X* 

> Sample new batch with replacement 

13 

X*^X*U{x* t+1 ,...,x* b } 



14 

A *^ i (fA*+Et i+ ilog 

'p{x*\8') 

p{x*\8) 

) 

15 

14 — b 




16 

C F- Ct{S tlooii ) 




17 

^look ^ ^look 1 




18 

b t— n A [7f] > Increase batchsize geometrically 

19 

if A* — ip(u, 9,9') > c or b > 

n 

20 

Done True 




21 

if A* > i!>{u,9,ff) 




22 

9k <— 9' > Accept 




23 

else 9k ^—0 0 Reject 



24 

return {9 k ) k = i,...,jv ittr 





Figure 9: Pseudocode of the confidence MF1 from (Bardenet et al. , 2014). Our contribution 
is a modification of Steps 14, 19 and 21 to introduce proxies for the log likelihood ratios, see 
Section 7. 

The bottleneck for the performance of the confidence sampler was identified in (Bardenet 
et al. , 2014) as the expectation w.r.t. iT(O)q(0'\O) of the variance of the log likelihood ratio 
\ogp{x\9') / p(x\0) w.r.t. to the empirical distribution of the observations. We now propose a 
control variate technique inspired from the Firefly MH of MacLaurin & Adams (2014) to lower 
this variance down when an accurate and cheap proxy of the log likelihood is known. 

We require a proxy for the log likelihood ratio that may decrease the variance of the log 
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likelihood ratio or its range. More precisely, let pi(6, 9') be such that for any 9,9' E 0, 

1. Pi(M') « 4(00 - 4(0) 

2. ]E" =1 pi(9, 9') can be computed cheaply. 

3. | £i(9') — £i(9) — pi(9,9')\ can be bounded uniformly in 1 < i < n, and the bound is 
cheap to compute. 

We now simply remark that the acceptance decision (21) in MH is equivalent to checking 
whether 


n 

-T 

n ^ 


i =1 L 


i „v{xi\9 j 

log / - Pi (0,0 ) 

P[Xi\9) 


N i, i, b(0 , M0|0 , )l i 

> log U log , . . -x 

n n lp{9)q(9‘ \9) J n " 


22g>i(0,0'). (28) 


Building the confidence sampler on (28) leads to the same pseudocode as in Figure 9, except 
that Step 14 is replaced by 


A* 




i=t +1 L 


log dS!E- Pi (M') 

P«|0) 


the condition in Step 19 is replaced by 


1 


A * + - Pi( 0 , 0 ') - 0 , 0 ') 


2—1 


> 


and the condition in Step 21 becomes 


A* >^(u,9,9') --y2pi(9,0'). 

n 0-' 


i=l 


For completeness, we restate here in Proposition 7.1 that the vanilla confidence sampler 
inherits the uniform ergodicity of the underlying MH sampler, that its target is within 0(6) 
of 7T, and that the difference in speed of convergence is also controlled by 5. Let P be the 
underlying MH kernel, and P the kernel of the confidence sampler described in this section. 


Proposition 7.1 Let P be uniformly geometrically ergodic, i.e., there exists an integer m, a 
probability measure v on (0,73(0)) and 0 < p < 1 such that for all 9 E 0, P m (9,-) > 
(1 — p) u(-). Hence there exists .4 < oc such that 

V9 E 0, Vk > 0, \\P k (9,-)-ir\\TV<M k/ml - (29) 

Then there exists B < oc and a probability distribution if on (0, B (0)) such that for all 9 E 0 
and k > 0, 

||P fc (0,-)-7f||7v< B[l- ( l - 5 ) m ( l - p )] L fc / m J . ( 30 ) 

Furthermore, ft satisfies 

Am5 

\\tt - tt\\tv < z -• ( 31 ) 

1 ~P 
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Even in the presence of proxies, the proofs of (Bardenet et al. , 2014, Lemma 3.1, Proposi¬ 
tion 3.2) apply with straightforward modifications, so that we can extend Proposition 7.1 to the 
proxy case. The major advantage of this new algorithm is that the sample standard deviation 
and range Cg$i in the concentration inequality (27) are replaced by those of 


log 


P(x* |0 A 


.P(x*\0) _ 


- = 1 ,... ,t 


If pi(9, 9’) is a good proxy for the log likelihood ratio, one can thus expect significantly more 
accurate confidence bounds, leading in turn to reduction in the number of samples used. 


7.2 An example proxy: Taylor expansions 

In general, the choice of proxy p will be problem-dependent, and the availability of a good 
proxy at all is a strong assumption, although not as strong as our previous requirement in Bar¬ 
denet et al. (2014) that the range (25) can be computed cheaply, which basically corresponds 
to pi(9, 9’) being identically zero for all i. Indeed, we show in this section that all models that 
possess up to third derivatives can typically be tackled using Taylor expansions as proxies. In 
Section 8 , we detail the case of logistic regression and gamma linear regression. 


7.2.1 Taylor expansions 


We expand 4 around some reference value 9* to obtain an estimate 

4(0) = 4(0*) + <£*(0 - 0*) H-1(0 - 0*) T i4,*(0 - 0*), 

where < 7 ^* and H lir are respectively the gradient and the Hessian of at 0*. The choice of 
0* is deferred to Section 7.2.2. Let us now define pi(9,9') = h{9 r ) — 4(0). The average 
n X)iLi Pi(9,9') can be computed in 0(1) time if one has precomputed 


P = 


n 

- 9i •* 


i=l 


and 

Indeed, the following holds 
1 


n ' 


n 




i=l 


ft T (9' - 9) + i(0' - 9) T S(9 + 9’ - 20* 


i— 1 


Linally, assuming 


dli 

90(7)90( fe )90(0 e 


can be bounded uniformly in i,j, k, l, the absolute difference 4(00 — 4(0) _ Pi(9, 9') can be 
bounded using the Taylor-Lagrange inequality. To conclude, all conditions of Section 7.1 are 
satisfied by the proxy pi. 
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7.2.2 Drop proxies along the way 

When the mass of the posterior is concentrated around the maximum likelihood estimator #mle> 
a single proxy - say a Taylor proxy centered at 9+ = #mle - can represent the target quite 
accurately. This is the proxy we used in the running examples of Section 6, see Figure 8. 
When the posterior does not concentrate, or the proposal is not local enough, such a proxy will 
be inaccurate, potentially resulting in insufficient subsampling gains. There are various tricks 
that can be applied. One can either precompute proxies across 0 if one has an idea where 
the modes of it are, and then use the closest proxy to the current state of the chain at each 
iteration. Alternately, if one agrees to look at the whole dataset every a iterations, we can 
drop proxies along the way, i.e. set 0* to the current state of the chain every a MH iterations. 
The whole dataset needs to be browsed at each change of the reference point 0*, since there is 
typically some preprocessing to do in order to compute later bounds. In the case of 2nd order 
Taylor expansions, for example, one has to compute the full gradient, Hessian, and any other 
quantity needed to bound the third derivatives. What the user should aim at is to sacrifice a 
proportion a of the budget of the ideal MH to make the remaining iterations cheaper. The proof 
of Proposition 7.1 easily generalizes to the case of proxies dropped every constant number of 
iterations. We demonstrate the empirical performance of such an approach in Sections 8.1.3 
and 8.2.2. 

7.2.3 A heuristic on the subsampling gain 

In (Bardenet et al. , 2014), we presented a heuristic that showed the original confidence sampler 
required 0{n) likelihood evaluations per iteration. At the time, it seemed every attempt at 
marrying subsampling and MH was fundamentally 0(n). We first repeat here the heuristic 
from (Bardenet et al. , 2014), before arguing that the contributions of this paper can lower this 
budget to o(n), even 0(1 ) up to polylogarithmic factors in very favourable conditions. 

Assuming a symmetric proposal and a flat prior, the stopping rule of the while loop in the 
original confidence sampler in Figure 9 is met whenever 



is of the same order as the confidence bound ct(5), that is, when c/ (d) is of order l/n. We 
consider the Bernstein bound in (27) and we assume the range C t fi,e' grows with n strictly 
slower than yfn. The latter assumption is realistic: Ctfifi' is often dominated by some power of 
max ||Xj || oo, and if xi, ..., x n are drawn i.i.d. from a subgaussian distribution, thenE max" =1 x$ 
0(y/logn) (Cesa-Bianchi & Lugosi, 2006, Lemma A. 13). The leading term of the Bernstein 
bound is proportional to dt,e,6'/Vt . In simple models such as logistic regression, dt,6,d' is 
proportional to \\Q — 0'\\. 

Assuming n is large enough that standard asymptotics apply and the target is approximately 
Gaussian, the results of Roberts & Rosenthal (2001) lead to choose the covariance matrix of the 
proposal such that || 9 — 9'\\ is of order n -1 / 2 . Summing up, we exit the while loop when 



n s/tyfn 
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which leads to t ~ n. 

Now consider the confidence sampler with second-order Taylor proxies introduced in Sec¬ 
tion 7.2.1. at,0,0' an d Ctfifi' now correspond to the standard deviation and range of 


log 


P(x*\6') 

_p{x*\9) 


-pI(M') 


; 1 < i < t 


Now let us assume the third-order derivatives at the reference point 0* can be bounded, say 
by some constant times max, | X, 11 as will be the case for the exponential family models of 
Section 8. Then ertfifi’ an d Ci.o.o 1 are dominated by 

max IIJSQH^, (||0 - 0*|| 3 + \\6' - ^|| 3 ) . (32) 

i 


But 11 9 — 0*|| is of order rG 1 / 2 if standard asymptotics (van der Vaart, 2000) yield good approx¬ 
imations and is set to the maximum of the posterior. Alternatively, if one has implemented 
the strategy of dropping proxies regularly, then || 9 — 0*|| should be of order n -1 / 2 since we 
assume the covariance matrix of the proposal distribution is of order 1/n. Again assuming that 
max* grows, say, like p(n) = o(n 1 / 3 ), we now exit the while loop when 

1 p(n ) 3 o(l) 

— r\_/ - - < 

n y/tn 3 / 2 y/ty/n 

Thus, when the target is approximately Gaussian and the chain is in the mode, the cost in 
likelihood evaluations per iteration of the confidence sampler with proxy is likely to be o(n). 
The actual order of convergence depends on the rate of growth of the bounds on the third 
derivatives. For example, in the case of independent Gaussian data and still assuming (32), we 
have t = 0(1) up to polylogarithmic factors. 


8 Experiments 

As a proof of concept, all experiments in this section avoid loading the dataset or proxy-related 
quantities into memory by building, maintaining and querying from a disk-based database using 
SQLite 2 . 

8.1 Logistic regression 

8.1.1 A Taylor proxy for logistic regression 

In logistic regression, the likelihood is defined by £i(9) = <f>(ti,xf9), where 

4>{z) = -log(l + e~ z ) 

and the label ti is in {—1, +1}. We can use the Taylor expansion proxy of Section 7.2.1, using 

Qi ,* = <j)'(tixj 9*)tiXi 

2 http://www. sqlite.org/ 
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Figure 10: Results of 10 000 iterations of confidence MH with a single Taylor proxy, applied to 
a synthetic logistic regression dataset vs. n 
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\£i(e')-£i(9) - Pi(9,9 ')I 

< max ||xj|| 3 {\\9 - 0*|| 3 + \\9' - 6>*|| 3 } . 

Z4 2—1 

8.1.2 A toy example that requires 0(1) likelihood evaluations 

In this section, we consider the simple two-dimensional logistic regression dataset in (Bardenet 
et al. , 2014, Section 4.2.2), where the features within each class are drawn from a Gaussian. 
The dataset is depicted in Figure 10(a). We consider subsets of the dataset with increasing 
size log 10 n G {3,4, 5, 6, 7}, run a confidence MH chain for each n, started at the MAP, with 
6 = 0.1 and a single proxy around the MAP. We report the numbers of likelihood evaluations 
L at each iteration in Figure 10(b). The fraction of likelihood evaluations compared to MH 
roughly decreases by a factor 10 when the size of the dataset is multiplied by 10: the number of 
likelihood evaluations is constant for n large enough. In other words, 1 000 random data points 
at each iteration are enough to get within 0(6) of the actual posterior, the rest of the dataset 
appears to be superfluous. There is a saturation phenomenon. By relaxing the goal of sampling 
from 7r into sampling from a controlled approximation, we can break the O(n) bander and in 
this particular example reach a cost per iteration of 0(1). 
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(b) Fraction of likelihood evaluations 


Figure 11: Results of 5 runs of a confidence sampler with Taylor proxies dropped every 10 
iterations, applied to logistic regression on covtype. In Figure 11(a), a solid line corresponds 
to the online posterior mean of the 1st component of the chain vs. the budget of MH, while a 
dashed line of the same color corresponds to the budget of the confidence sampler. 


8.1.3 The covtype dataset 

We consider the dataset covtype.binary* described in Collobert et al. (2002). The dataset 
consists of 581,012 points, of which we pick n = 400,000 as a training set, following the 
maximum training size in Collobert et al. (2002). The original dimension of the problem is 
54, with the first 10 attributes being quantitative. To illustrate our point without requiring a 
more complex sampler than MH, we only consider the 10 quantitative attributes. We use the 
preprocessing and Cauchy prior recommended by Gelman et al. (2008). 

We run 5 independent chains for 10 000 iterations, dropping proxies every 10 iterations as 
explained in Section 7.2.2. We obtain a Gelman-Rubin statistic of 1.01 (Robert & Casella, 
2004, Section 12.3.4), which suggests the between-chain variance is low enough that we can 
stop sampling. 

We estimate the number of likelihood evaluations L at MH iteration k as follows. First, 
note that -dropping proxies or not- on a regular iteration where the proxy is not necessarily 
recomputed, L *. can take values up to 2 n, unlike MH, which can store the evaluation of the 
likelihood at the current state of the chain from one iteration to the next, and thus only requires 
n likelihood evaluations per iteration. Second, at an iteration where the proxy is recomputed, 
the whole data has to be read anyway, so that we choose here to perform a normal MH iteration. 
This requires the maximum 2 n likelihood evaluations, Assuming the cost of the likelihood 
evaluation is the bottleneck, we neglect here the additional cost of computing the proxy itself, 
and only report L & = 2 n when the proxy is recomputed. Third, whenever we compute the full 
likelihood at a state of the chain, we store it until the chain leaves that state, similarly to any 
implementation of MH. Thus, at an iteration that follows a full read of the data, i.e. L^-i = 2 n, 

’available at http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html 
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we only count the likelihood evaluations of the proposed state. 

We summarize the results in Figure 11. All runs use on average 27 to 42% of n likelihood 
evaluations per iteration. Since we compute the proxy every a = 10 iterations, there is a 
necessary 2x10 = 20% of n that is due to recomputing the proxy. We manually assessed the 
value of a, and recomputing the proxy less often increases the average number of likelihood 
evaluations (not shown). Thanks to these forced 20%, the rest of the iterations are considerably 
cheaper than n, since 50% of the iterations require less than 5% of the dataset, as shown in 
Figure 11(b). Relatedly, although subsampling implies a forced 2 n likelihood evaluations to 
start and thus shows an initial delay in Figure 11(a), it quickly catches up and converges faster. 
The gains arc two- or threefold, which is of limited overall practical interest, but we know from 
Section 7.2.3 and Figure 10(b) that increasing n will also improve the gain. 


8.2 Gamma linear regression 

8.2.1 A Taylor proxy for gamma regression 

In gamma regression, the nonnegative response y is assumed to be gamma-distributed 


y ~ r 



K 


where F(k, s) is the gamma distribution with shape parameter k and scale parameters s. As¬ 
suming k is known, the log likelihood is thus given by 

logp(y|x,0) oc —nye~ e x — k9 t x 

up to an additive constant, so that 

Vlogp(y|x,0) = k (ye~ xTe — 1^) x, 


and 


<90O')<90( fc )<90« 

Furthermore, we can bound 

<9 


Fless(logp(y|x, 0)) = — nye x °xx T 

logp(y\x,9) = Kye~ xT 6 x^ x^ x^. 




2=1 


< Kmax|y|exp ( — minxfO I max \\xi 


2=1 


l<2<n 


I 3 

loo* 


The Taylor proxies of Section 7.2.1 can thus be applied. 


8.2.2 The covtype dataset 

As an application, we consider the covtype dataset again and regress the nonnegative feature 
“horizontal distance to nearest wildfire ignition” onto the other quantitative features. We run 5 
independent chains for 10 000 iterations, dropping proxies every 10 iterations as explained in 
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Figure 12: Results of 5 runs of a confidence sampler with Taylor proxies dropped every 10 iter¬ 
ations, applied to gamma linear regression on covtype. In Figure 12(a), a solid line corresponds 
to the online posterior mean of the 1st component of the chain vs. the budget of MH, while a 
dashed line of the same color corresponds to the budget of the confidence sampler. 


Section 7.2.2. We obtain a Gelman-Rubin statistic of 1.001, which again suggests we can stop 
sampling. We estimate the evaluation budget as in Section 8.1.3. We summarize the results in 
Figure 12. 

All runs use on average 33 to 54% of n likelihood evaluations per iteration, from which 
2 x 10 = 20% are due to recomputing the proxy every 10 iterations. Recomputing the proxy 
less often increases the average number of likelihood evaluations (not shown). Thanks to these 
forced 20% the rest of the iterations are considerably cheaper than n, since, as in Section 8.1.3, 
50% of the iterations require less than 10% of the dataset. Relatedly, and similarly to the logistic 
regression task in Section 8.1.3 subsampling converges two or three times faster in this example. 
Again, this is a proof of concept that subsampling works, and we know from Section 7.2.3 and 
Figure 10(b) that increasing n will also improve the gain. 

9 Discussion 

We have reviewed recent advances in applying MCMC to tall datasets. Divide-and-conquer 
approaches have yet to solve the recombination problem, i.e. how to obtain a meaningful distri¬ 
bution in a stable manner from the output of individual chains on a growing number of smaller 
datasets. Subsampling approaches face different issues, namely that of approaching the right 
target at a known speed, and of keeping the overall budget in terms of likelihood evaluations 
per iteration low. 

In this paper, we have proposed an original subsampling approach. We have showed that 
under strong ergodicity assumptions on the original MH sampler, our algorithm samples from 
a controlled approximation of the posterior target. While these strong assumptions are rarely 
satisfied in practice, our experiments suggest that our results extend to more general scenarios. 
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In terms of scaling, the introduced methodology is even able to lower the natural cost of 0(n) 
subsamples per iteration to as low as 0(1 ) in favourable scenarios. However, we have yet only 
observed these dramatic gains in contexts where the Bernstein-von Mises approximation is al¬ 
ready excellent. On the positive side, our algorithm improves on other proposed subsampling 
approaches in this context. On the negative side, computing the Bernstein-von Mises approx¬ 
imation for regular models can be typically achieved in only a couple of passes over the data, 
using for example stochastic gradient to compute the maximum likelihood estimator, and the 
observed information matrix at this point to estimate the Hessian. 

Further work should thus now focus on demonstrating the applicability of subsampling 
approaches to cases where it is either difficult to compute Bernstein-von Mises even if it is a 
good approximation (Chernozhukov & Hong, 2003), or - more importantly - cases where n is 
not big enough that Bernstein-von Mises yields a good approximation. 
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Appendix A: proof of Proposition 4.1 

Define 



(33) 


k =1 j=l 


By construction and the monotone convergence theorem, KS n ES = e n ^ e \ where 



k =1 j =1 


From (Rhee & Glynn, 2013, Theorem 1), the second moment of Y is 



with the convention ,S_i = 0 and Sq = e nat ' 0) . We note that 



p=k+lq=k+l^ ^' u=l v=l 
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and by definition of N, it comes 

Vary 


c 2 n£(0) 


> e -2n(<(9)-a(«)) ^ _ ! 


k=0 


(2fc + l)! 


g-2n(*(0)-a(0)) 
2\/ ^.n 

g-2 n(+0)-a(0)) 


sinh(2y / H^) - 1 
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- 1 


4\/ ^-n 

g-2n(+0)-a(0))+2ny / (l+e)[crt(0) 2 +(£(0)—a(0)) 2 ] 

nV(l + e)[a t (0)2 + (£(0)-a(0)H 


+ 0 ( 1 ). 
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Appendix B: proof of Proposition 4.2 


We write 


n n 


Var 2 ^logpfol*,*) = ^ j^lE log 2 p(xj\0, zfj - (Elogp(xj\9, Zj)f 


. 1=1 J Z—1 



e 4(0) __ e bi(6) 



2 


(1 - Iq) log --- - - + I() log 


1-Jo 



2 


n r t 

hif - h) lo S 2 x ( e ^ (0) 
«=1 L 


(AV)-bi(0) 
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